TL;DR I show how I determine the parts of an application that trigger unnecessary SQL statements by using graph analysis of a call tree.
We don't need more tools that show use more problems in our software. We need ways to determine the right problem to solve. But often we just fix the symptoms on the surface rather than the underlying problems. I find that approach non-professional and want to do my part to improve this situation by delivering root cause analysis of symptoms to get to the real problems in our software.
In this notebook, I'll show you one of my approaches for mining performance problems based on application runtime performance analysis. In general, I use this approach to make the point that there are severe design errors that have a negative influence on the application's overall performance. In this example, I show you how I determine the reasons behind a massive amount of executed SQL statements in an application step by step.
The key idea is to use graph analysis to analyze call stacks that were created by a profiling tool. With this approach, I'm not only able to show the hotspots that are involved in the performance issue (because the hotspots show that some SQL statements take long to execute or are executed too often), but also the reasons behind these hotspots. I achieve this by extracting various additional information like the web requests, application's entry points and the triggers within the application causing the hotspots.
This is very helpful to determine the most critical parts in the application and gives you a hint where you could start improving immediately. I use this analysis at work to determine the biggest performance bottleneck in a medium sized application (~700 kLOCs). Based on the results, we work out possible improvements for that specific hotspot, create a prototypical fix for it, measure the fix's impact and, if the results are convincing, roll out the fix for that problem application wide (and work on the next performance bottleneck and so on).
I hope you'll see that this is a very reasonable approach albeit the simplified use case that I show in this blog post/notebook.
Before we start, I want to briefly introduce you to the setup that I use for this analysis:
The first ones are dependent on the environment and programming language you use. jQAssistant, Neo4j and Pandas are my default environment for software analytics so far. I'll show your how all those tools fit together.
So let's get started!
*actually, what you see here, is the result of an executed Jupyter notebook, too. You can find that notebook on GitHub.
As a prerequisite for this analysis, we need performance profiling data gathered by a profiler. A profiler will be integrated into the runtime environment (e. g. Java Virtual Machine) of your application and measures diverse properties like method execution time, number of web service calls, executed SQL statements etc. Additionally, we need something that uses or clicks through our application to get some numbers. In my case, I run the Spring PetClinic performance test using JMeter.
As profiling tool, I use JProfiler to record some performance measures while the test was running.
<advertisment>
At this point, I want to thank ej-technologies for providing me with a [free open-source license](https://www.ej-technologies.com/buy/jprofiler/openSource) for JProfiler that enables this blog post in exchange of mentioning their product:
![](https://www.ej-technologies.com/images/product_banners/jprofiler_large.png)
JProfiler is a great commercial tool for profiling Java application and costs around 400 €. It really worth the money because it gives you deep insights how your application performs under the hood.
</advertisment>
Also outside the advertisement block, I personally like JProfiler a lot because it does what it does very very good. Back to the article.
The recording of the measures starts before the execution of the performance test and stops after the test has finished successfully. The result is stored in a file as so-called "snapshot". The use of a snapshot enables you to repeat your analysis over and over again with exactly the same performance measures.
What we usually need for performance analysis is a recorded runtime stack of all method calls as a call tree. A call tree shows you a tree of the called methods. Below, you can see the call tree for the called methods with their measured CPU wall clock time (aka the real time that is spent in that method) and the number of invocations for a complete test run:
With such a view, you see which parts of your application call which classes and methods by drilling down the hierarchy by hand:
But there is more: You can also "turn around" the call tree and list all the so-called "HotSpots". Technically, e. g. for CPU HotSpots, JProfiler sums up all the measurements for the method call leafs that take longer than 0.1% of all method calls. With this view, you see the application's hotspots immediately:
These views are also available for other measures like web service calls, file accesses or DB calls, that is shown below:
This is the data that we need for our SQL statement analysis. The big problem is, that we can't easily see where all those SQL statements come from because we just see the isolated SQL statements.
And this is where our journey begins...
In [1]:
with open (r'input/spring-petclinic/JDBC_Probe_Hot_Spots_jmeter_test.xml') as log:
[print(line[:97] + "...") for line in log.readlines()[:10]]
For a full version of this file, see GitHub: https://github.com/feststelltaste/software-analytics/blob/master/notebooks/input/spring-petclinic/JDBC_Probe_Hot_Spots_jmeter_test.xml
This file consists of all the information that we've seen in the JProfiler GUI, but as XML elements and attributes. And here comes the great part: The content itself is graph-like because the XML elements are nested! So the <tree> element contains the <hotspot> elements that contain the <node> elements and so on. A case for a graph database like Neo4j!
But how do we get that XML file into our Neo4j database? jQAssistant to the rescue!
jQAssistant is a great and versatile tool for scanning various graph-like structured data into Neo4j (see my experiences with jQAssistant so far for more information). I just downloaded the version 1.1.3, added the binary to my PATH system variable and executed the following command (works for jQAssistant versions prior to 1.2.0, I haven't figured it out how to do it with newer versions yet):
jqassistant scan -f xml:document::JDBC_Probe_Hot_Spots_jmeter_test.xml
This will import the XML structure as a graph into the Neo4j graph database that is used by jQAssistant under the hood.
So, if we want to have a quick look at the stored data, we can start jQAssistant's Neo4j embedded instance via
jqassistant server
open http://localhost:7474, click in the overview menu at the label File, click on some nodes and you will see something like this:
It shows the content of the XML file from above as a graph in Neo4j:
So, for example the attribute with the name value contains the executed SQL statement:
The attribute with the name count contains the number of executions of a SQL statement:
Each element or attribute is also labeled correspondingly with Element or Attribute.
I want to show you the data from the database in a more nicer way. So, we load our main libraries and initialize the connection to Neo4j database by creating a Graph object (for more details on this have a look at this blog post)
In [2]:
import pandas as pd
from py2neo import Graph
graph = Graph()
We execute a simple query for one XML element and it's relationships to its attributes.
For example, if we want to display the data of this <hotspot element
<hotspot
leaf="false"
value="SELECT id, name FROM types ORDER BY name"
time="78386"
count="107">
as graph, we get that information from all the attributes of an element (don't worry about the syntax of the following two Cypher statements. They are just there to show you the underlying data as an example).
In [3]:
query="""
MATCH (e:Element)-[:HAS_ATTRIBUTE]->(a:Attribute)
WHERE a.value = "SELECT id, name FROM types ORDER BY name"
WITH e as node
MATCH (node)-[:HAS_ATTRIBUTE]->(all:Attribute)
RETURN all.name, all.value
"""
pd.DataFrame(graph.run(query).data())
Out[3]:
As seen in the picture with the huge graph above, each <hotspot> node refers to the further <node>s, that call the hotspots. In our case, these nodes are the methods in our application that are responsible for the executions of the SQL statements.
If we list all attributes of such a node, we've got plenty of information of the callees of the hotspots. For example these nodes contain information about the method name (method) or the number of executed SQL statements (count):
In [4]:
query="""
MATCH (e:Element)-[:HAS_ATTRIBUTE]->(a:Attribute)
WHERE id(e) = 12 //just select an arbitrary node
RETURN a.name, a.value
"""
pd.DataFrame(graph.run(query).data())
Out[4]:
Before executing the first statements, we clean up any preexisting data from previous queries. This is only necessary when you execute this notebook several times on the same data store (like me). It makes the results repeatable and thus more reproducible (a property we should generally strive for!).
In [5]:
query="""
MATCH (n:Node)-[r:CALLS|CREATED_FROM|LEADS_TO]->()
DELETE r, n
RETURN COUNT(r), COUNT(n)"""
graph.run(query).data()
Out[5]:
We create some new nodes that contain all the information from the XML part of the graph that we need. We simply copy the values of some attributes to new Call nodes.
In our Cypher query, we first retrieve all <node> elements (identified by their "name" property) and some attributes that we need for our analysis. For each relevant information item, we create a variable to retrieve the information later on:
MATCH (n:Element {name: "node"}),
(n)-[:HAS_ATTRIBUTE]->(classAttribut:Attribute {name : "class"}),
(n)-[:HAS_ATTRIBUTE]->(methodAttribut:Attribute {name : "methodName"}),
(n)-[:HAS_ATTRIBUTE]->(countAttribut:Attribute {name : "count"}),
(n)-[:HAS_ATTRIBUTE]->(timeAttribut:Attribute {name : "time"})
For each <node> element we've found, we tag the nodes with the label Node to have a general marker for the JProfiler measurements (which is "node" by coincidence) and mark all nodes that contain information about the calling classes and methods with the label Call:
CREATE
(x:Node:Call {
We also copy the relevant information from the <node> element's attributes into the new nodes. We put the value of the class attribute (that consists of the Java package name and the class name) into the fqn (full qualified name) property and add just the name of the class in the class property (just for displaying reasons in the end). The rest is copied as well, including some type conversions for count and time:
fqn: classAttribut.value,
class: SPLIT(classAttribut.value,".")[-1],
method: methodAttribut.value,
count: toFloat(countAttribut.value),
time: toFloat(timeAttribut.value)
})
Additionally, we track the origin of the information by a CREATED_FROM relationship to connect the new nodes later on:
-[r:CREATED_FROM]->(n)
So, the complete query looks like the following and will be executed against the Neo4j data store:
In [6]:
query = """
MATCH (n:Element {name: "node"}),
(n)-[:HAS_ATTRIBUTE]->(classAttribut:Attribute {name : "class"}),
(n)-[:HAS_ATTRIBUTE]->(methodAttribut:Attribute {name : "methodName"}),
(n)-[:HAS_ATTRIBUTE]->(countAttribut:Attribute {name : "count"}),
(n)-[:HAS_ATTRIBUTE]->(timeAttribut:Attribute {name : "time"})
CREATE
(x:Node:Call {
fqn: classAttribut.value,
class: SPLIT(classAttribut.value,".")[-1],
method: methodAttribut.value,
count: toFloat(countAttribut.value),
time: toFloat(timeAttribut.value)
})-[r:CREATED_FROM]->(n)
RETURN COUNT(x), COUNT(r)
"""
graph.run(query).data()
Out[6]:
We do the same for the <hotspot> elements. Here, the attributes are a little bit different, because we are gathering data from the hotspots that contain information about the executed SQL statements:
In [7]:
query = """
MATCH (n:Element { name: "hotspot"}),
(n)-[:HAS_ATTRIBUTE]->(valueAttribut:Attribute {name : "value"}),
(n)-[:HAS_ATTRIBUTE]->(countAttribut:Attribute {name : "count"}),
(n)-[:HAS_ATTRIBUTE]->(timeAttribut:Attribute {name : "time"})
WHERE n.name = "hotspot"
CREATE
(x:Node:HotSpot {
value: valueAttribut.value,
count: toFloat(countAttribut.value),
time: toFloat(timeAttribut.value)
})-[r:CREATED_FROM]->(n)
RETURN COUNT(x), COUNT(r)
"""
graph.run(query).data()
Out[7]:
Now, we have many new nodes in our database that are aren't directly connected. E. g. a Call node looks like this:
So, let's connect them. How? We've saved that information with our CREATED_FROM relationship:
This information can be used to connect the Call nodes as well as the HotSpot nodes.
In [8]:
query="""
MATCH (outerNode:Node)-[:CREATED_FROM]->
(outerElement:Element)-[:HAS_ELEMENT]->
(innerElement:Element)<-[:CREATED_FROM]-(innerNode:Node)
CREATE (outerNode)<-[r:CALLS]-(innerNode)
RETURN COUNT(r)
"""
graph.run(query).data()
Out[8]:
And there we have it! Just click in the Neo4j browser on the relationship CALLS and the you'll see our call tree from JProfiler as call graph in Neo4j, ready for root cause analysis!
In [ ]:
All the work before was just there to get a nice graph model that feels more natural. Now comes the analysis part: As mentioned in the introduction, we don't only want the hotspots that signal that something awkward happened, but also
Speaking in graph terms, we need some specific nodes of our call tree graph with the following information:
These points in the call tree should give us enough information that we can determine where to look for improvements in our application.
There is one thing that is a little bit tricky to implement: It's to model what we see as "last" and "first" in Neo4j / Cypher. Because we are using the package name of a class to identify our own application, there are many Call nodes in one call graph part that have that package name. Neo4j would (rightly) return too many results (for us) because it would deliver one result for each match. And a match is also given when a Call node within our application matches the package name. So, how do we mark the first and last nodes of our application code?
Well, take one step at a time. Before we are doing anything awkward, we are trying to store all the information that we know into the graph before executing our analysis. I always favor this approach instead of trying to find a solution with complicated cypher queries, where you'll probably mix up things easily.
In [9]:
query="""
MATCH (x)
WHERE x.fqn = "_jprofiler_annotation_class" AND x.method STARTS WITH "HTTP"
SET x:Request
RETURN COUNT(x)
"""
graph.run(query).data()
Out[9]:
Next, we identify the entry points aka first nodes of our application. We can achieve this by first searching all the shortest paths between the already existing Request nodes and all the nodes that have our package name. From all these subgraphs, we take only the single subgraph that has only a single node with the package name of our application. This is the first node that occurs in the call graph when starting from the request (I somehow can feel that there is a more elegant way to do this. If so, please let me know!). We mark that nodes as Entry nodes.
In [10]:
query="""
MATCH
request_to_entry=shortestPath((request:Request)-[:CALLS*]->(entry:Call))
WHERE
entry.fqn STARTS WITH "org.springframework.samples.petclinic"
AND
SINGLE(n IN NODES(request_to_entry)
WHERE exists(n.fqn) AND n.fqn STARTS WITH "org.springframework.samples.petclinic")
SET
entry:Entry
RETURN COUNT(entry)
"""
graph.run(query).data()
Out[10]:
With the same approach, we can mark all the calls that trigger the execution of the SQL statements with the label Trigger.
In [11]:
query="""
MATCH
trigger_to_hotspot=shortestPath((trigger:Call)-[:CALLS*]->(hotspot:HotSpot))
WHERE
trigger.fqn STARTS WITH "org.springframework.samples.petclinic"
AND
SINGLE(n IN NODES(trigger_to_hotspot)
WHERE exists(n.fqn) AND n.fqn STARTS WITH "org.springframework.samples.petclinic")
SET
trigger:Trigger
RETURN count(trigger)
"""
graph.run(query).data()
Out[11]:
After marking all the relevant nodes, we connect them via a new relationshop LEADS_TO to enable more elegant queries later on.
In [12]:
query="""
MATCH
(request:Request)-[:CALLS*]->
(entry:Entry)-[:CALLS*]->
(trigger:Trigger)-[:CALLS*]->
(hotspot:HotSpot)
CREATE UNIQUE
(request)-[leads1:LEADS_TO]->
(entry)-[leads2:LEADS_TO]->
(trigger)-[leads3:LEADS_TO]->(hotspot)
RETURN count(leads1), count(leads2), count(leads3)
"""
graph.run(query).data()
Out[12]:
In [13]:
query="""
MATCH
(request:Request)-[:LEADS_TO]->
(entry:Entry)-[:LEADS_TO]->
(trigger:Trigger)-[:LEADS_TO]->
(hotspot:HotSpot)
RETURN
request.method as request,
request.count as sql_count,
entry.class as entry_class,
entry.method as entry_method,
trigger.class as trigger_class,
trigger.method as trigger_method,
hotspot.value as sql,
hotspot.count as sql_count_sum
"""
hotspots = pd.DataFrame(graph.run(query).data())
hotspots.head()
Out[13]:
The returned data consists of
If we take a look at the Neo4j browser and execute the statement from above but returning the nodes, aka
MATCH
(request:Request)-[:LEADS_TO]->
(entry:Entry)-[:LEADS_TO]->
(trigger:Trigger)-[:LEADS_TO]->
(hotspot:HotSpot)
RETURN
request, entry, trigger, hotspot
we get a nice overview of all our performance hotspots, e. g.
With this graphical view, it's easy to see the connection between the requests, our application code and the hotspots.
Albeit this view is nice for exploration, it's not actionable. So let's use Pandas to shape the data to knowledge!
In [14]:
sqls_per_method = hotspots.groupby([
'request',
'entry_class',
'entry_method',
'trigger_class',
'trigger_method']).agg(
{'sql_count' : 'sum',
'request' : 'count'})
sqls_per_method
Out[14]:
We see immediately that we have an issue with the loading of the pet's owners via the OwnerController. Let's look at the problem from another perspective: What kind of data is loaded by whom from the tables. We simply chop the SQL and extract just the name of the database table (in fact, the regex is so simple that some of the tables weren't identified. But because these are special cases, we can ignore them):
In [15]:
hotspots['table'] = hotspots['sql'].\
str.upper().str.extract(
r".*(FROM|INTO|UPDATE) ([\w\.]*)",
expand=True)[1]
hotspots['table'].value_counts()
Out[15]:
And group the hotspots accordingly.
In [16]:
grouped_by_entry_class_and_table = hotspots.groupby(['entry_class', 'table'])[['sql_count']].sum()
grouped_by_entry_class_and_table
Out[16]:
Now we made the problem more obvious: The class OwnerController works heavily with the PETS table and the pet's TYPES table. Surely an error in our program.
Let's visualize the problem with a nice donut chart in Bokeh:
In [17]:
from bokeh.charts import Donut, show, output_notebook
plot_data = grouped_by_entry_class_and_table.reset_index()
d = Donut(plot_data, label=['entry_class', 'table'],
values='sql_count',
text_font_size='8pt',
hover_text='sql_count'
)
output_notebook()
show(d)
Out[17]:
You could also have a look at the most problematic spot in the application by grouping the data by the class and the method that triggers the execution of the most SQL statements.
In [18]:
hotspots.groupby(['trigger_class', 'trigger_method'])[['sql_count']].sum().sort_values(
by='sql_count', ascending=False).head(5)
Out[18]:
Again, the JdbcOwnerRepositoryImpl is heavily working with other data.
This gives you a very actionable information where you can start improving!
The overall result of this analysis is clearly an application problem and a sign of the famous N+1 query problem that's almost typical for any application that uses a database without care.
If you look at the corresponding source code, you'll find spots in your application like the one below in the JdbcOwnerRepositoryImpl class:
public void loadPetsAndVisits(final Owner owner) {
Map<String, Object> params = new HashMap<>();
params.put("id", owner.getId());
final List<JdbcPet> pets = this.namedParameterJdbcTemplate.query(
"SELECT pets.id, name, birth_date, type_id, owner_id,
visits.id as visit_id, visit_date, description, pet_id
FROM pets LEFT OUTER JOIN visits ON pets.id = pet_id
WHERE owner_id=:id", params, new JdbcPetVisitExtractor());
Collection<PetType> petTypes = getPetTypes();
for (JdbcPet pet : pets) {
pet.setType(EntityUtils.getById(petTypes, PetType.class, pet.getTypeId()));
owner.addPet(pet);
}
}
E. g. we see data that isn't needed in the UI, but loaded from a database within a loop. This can surely be improved!
This approach for mining performance problem follows a simple pattern:
With these steps, you quickly drill down to the root cause of problems.
But there are also limitations:
Overall, I'm really happy with this approach because it immediately shows me the performance pain points in our application.
I hope that you take away many new ideas for your analysis of software data:
The main idea was to load a huge graph with all the information and reduce it to just the specific information we need for action. In general, I use this approach to mine my way through any graph-based data format like XML files or other software structures.
Last but not least: Many thanks to my student Andreas, who had to prototype some of my crazy graph query ideas :-)
And as always: I'll be happy to answer your questions and please let me know what you think about this approach!